actor¶According to the wiki page, we can get rid of those columns:
standard_text_propertycount_text_propertyconcat_names| pk_actor | concat_actr | concat_standard_name | begin_year | certainty_begin | notes_begin | end_year | certainty_end | notes_end | gender_iso | notes | fk_abob_type_actor | creator | creation_time | modifier | modification_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20302 | 24321 | Actr24321 | Poilblan, Gustave | NaN | 1 | None | NaN | 1 | None | 1 | None | 104.0 | 11.0 | 2009-12-21 17:54:44.000 | 11.0 | 2013-12-18 15:24:16 |
| 33234 | 35095 | Actr35095 | Munier, Félicien | 1872.0 | 3 | None | 1926.0 | 1 | None | 1 | None | 104.0 | 11.0 | 2010-05-31 15:19:54.000 | 11.0 | 2013-12-18 15:24:16 |
| 58131 | 2004 | Actr2004 | Espinosa, Miguel de | 1580.0 | 3 | None | 1630.0 | 1 | None | 1 | None | 104.0 | 27.0 | 2008-11-09 00:11:18.000 | 50.0 | 2017-11-28 12:12:47 |
| 52506 | 48188 | Actr48188 | Triveri, Francesco Antonio - da Biella | 1631.0 | 1 | None | 1697.0 | 1 | None | 1 | None | 104.0 | 30.0 | 2014-03-16 09:18:33.900 | 50.0 | 2016-10-20 11:43:57 |
| 23887 | 54356 | Actr54356 | Luc, Jean André de | 1763.0 | None | 2 | 1847.0 | None | 2 | 1 | fb_import_20140911_3187 | 104.0 | 3.0 | 2014-09-11 22:23:09.190 | NaN | 2014-09-12 12:22:41 |
Some of the rows has been identified to not be imported (see this wiki page).
Rows number before filter: 61556 Rows number after filter: 59526 (2030 have been removed)
For now we are interested only in persons.
Persons can be found by having the column fk_abob_type_actor being 104.
Number of not 104 actors: 3
| pk_actor | concat_actr | concat_standard_name | begin_year | certainty_begin | notes_begin | end_year | certainty_end | notes_end | gender_iso | notes | fk_abob_type_actor | creator | creation_time | modifier | modification_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10340 | 59031 | Actr59031 | Forster, James | 1830.0 | 3 | 3 | 1930.0 | 3 | 3 | 1 | None | 106.0 | 81.0 | 2016-11-29 11:05:00.060 | 81.0 | 2016-11-29 11:05:00 |
| 28940 | 60660 | Actr60660 | Valjean, Jean | 1769.0 | 1 | None | 1833.0 | 1 | None | 1 | None | 106.0 | 122.0 | 2018-10-23 16:48:50.050 | 122.0 | 2018-10-23 16:48:50 |
| 46002 | 46914 | Actr46914 | Dieu (conception chrétienne) | NaN | 1 | None | NaN | None | None | 0 | None | 106.0 | 3.0 | 2013-07-04 11:43:15.990 | 3.0 | 2013-12-18 15:24:16 |
Columns contain: Total number of rows: 59523 - "pk_actor": 0.00% empty - 59523 (100.00%) uniques (eg: 44895; 47015) - "concat_actr": 0.00% empty - 59523 (100.00%) uniques (eg: Actr44895; Actr47015) - "concat_standard_name": 0.00% empty - 56550 ( 95.01%) uniques (eg: Sainte-Mar...; Costantino...) - "gender_iso": 0.00% empty - 3 ( 0.01%) uniques (eg: 1; 2) - "creation_time": 0.00% empty - 34441 ( 57.86%) uniques (eg: 2012-04-08...; 2013-07-26...) - "modification_time": 0.00% empty - 13973 ( 23.47%) uniques (eg: 2013-12-18...; 2016-10-21...) - "creator": 0.01% empty - 88 ( 0.15%) uniques (eg: 43.0; 30.0) - "modifier": 8.92% empty - 85 ( 0.14%) uniques (eg: 2.0; 30.0) - "certainty_begin": 9.42% empty - 4 ( 0.01%) uniques (eg: 3; 1) - "certainty_end": 14.48% empty - 5 ( 0.01%) uniques (eg: 3; None) - "begin_year": 18.56% empty - 847 ( 1.42%) uniques (eg: 1870.0; 1506.0) - "end_year": 50.66% empty - 819 ( 1.38%) uniques (eg: 1930.0; 1545.0) - "notes_begin": 67.74% empty - 5 ( 0.01%) uniques (eg: 3; 2) - "notes_end": 72.41% empty - 6 ( 0.01%) uniques (eg: 3; 4) - "notes": 89.85% empty - 6012 ( 10.10%) uniques (eg: <p>Il s'ag...; None)
According to the table before, we will parse each column by the most meaningful type.
Here we will report the analysis of interesting information found on different columns. They are not exhaustive.
For some columns, we will update their value.
We observe some of the gender values being undefined. As the ISO mentions, it should be 0, 1, 2 or 9. So we replace the undefined gender by 0.
We replace the not filled values by 0.
We replace the not filled values by 0.
All HTML tags, non ASCII chars and new line are removed.